Alejandro Schuler and David Connell
2022
Adapted from Steve Bagley and based on R for Data Science by Hadley Wickham
By the end of the course you should be able to…
If you haven't already, please open RStudio on DataHub by clicking this link. If you're viewing this on bCourses, you'll have to right click and then choose “Open Link in New Tab”.
You will get more out of this tutorial if you try out these things in R yourself!!
The R console window is the left (or lower-left) window in RStudio. The R console uses a “read, eval, print” loop. This is sometimes called a REPL.
> 1 + 2
[1] 3
3 is the answer[1] means: the answer is a vector (a list of elements of the same type) and this line starts with the first element of that vector.> 1 +2
> 1+ 2
> 1+2
> 1 + 2
These all do the same thing. The result of each line is 3:
[1] 3
> 1 + 2 * 3 # R respects order of operations
[1] 7
> 3/4
[1] 0.75
> 6^3
[1] 216
> log(10) # natural log
[1] 2.302585
> log10(10) # log base 10
[1] 1
> sqrt(16)
[1] 4
> c(2.1, -4, 22)
[1] 2.1 -4.0 22.0
c( ) function, which is short for “combine”> 1:50
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
: is a handy shortcut to create a vector that is
a sequence of integers from the first number to the second number
(inclusive).[ ] notation. The second output line starts
with 26, which is the 26th element of the vector.An operation is elementwise (or element-wise) if the action you perform on a vector produces a vector with the same dimensions as the original.
The code below multiplies each element of 1:10 by the corresponding
element of 1:10, that is, it squares each element.
> (1:10)*(1:10)
[1] 1 4 9 16 25 36 49 64 81 100
> (1:10)^2
[1] 1 4 9 16 25 36 49 64 81 100
: has a higher precedence than addition +.> 1 + 0:10
[1] 1 2 3 4 5 6 7 8 9 10 11
> 0:10 + 1 # which operator gets executed first?
[1] 1 2 3 4 5 6 7 8 9 10 11
> (0:10) + 1
[1] 1 2 3 4 5 6 7 8 9 10 11
> 0:(10 + 1)
[1] 0 1 2 3 4 5 6 7 8 9 10 11
> x <- 10
> x
[1] 10
> x / 5
[1] 2
/ is the division operator.In R, there are (unfortunately) two assignment operators. They have subtly different meanings (more details later).
<- requires that you type two characters. Don't put a
space between < and -. (What would happen?)Option -” (Mac) or “Alt -” (PC)
to type this using one key combination.= is easier to type.> x <- 10
> x
[1] 10
> x = 20
> x
[1] 20
<- to reduce confusion with the comparison operator == (more on that later).> x <- 10
> x
[1] 10
> x <- x + 1
> x
[1] 11
x and y everywhere.Main.database.first.object.header.length).?make.names for the complete rules on
what can be a name.> a <- 1
> A # this causes an error because A does not have a value
Error: object 'A' not found
There are different conventions for constructing compound names. Warning: disputes over the right way to do this can get heated.
stringlength
string.length
StringLength (CamelCase)
stringLength
string_length (underscore or underbar a.k.a. snake_case)
string-length (hyphen a.k.a. kebab-case)
> for <- 7 # this causes an error
for is a reserved word in R. (It is used in loop control.)?Reserved for the complete rules.> my_age_end_of_year = 31
> this_year = 2022
> my_birth_year = this_year - my_age_end_of_year
> my_birth_year
[1] 1991
Source: OOMPH course PHW251 - R for Public Health
> sqrt(2)
[1] 1.414214
> sqrt(0:10)
[1] 0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[9] 2.828427 3.000000 3.162278
> x <- 4
> sqrt(x)
[1] 2
> x
[1] 4
> y <- sqrt(x)
> y
[1] 2
> x <- 10
> y
[1] 2
y after changing the value of x?x remains the same after sqrt(x))y), it keeps its value until updated, even if you change other variables (x) that went into the original assignment of that variable> sum
sum, then hit the TAB key (or just wait a second)sum.RETURN or ENTER to select the current
item.Type ?name for help on name. Example:
> ?log
log function (and related functions) in the Help pane, including the name and meaning of the arguments and returned values. > ?"+"
+ operator.> weights <- c(1.1, 2.2, 3.3)
> weights <- c(1.1, 2.2, 3.3)
> # this divides the weights, element-wise, by the conversion factor:
> weights / 2.2
[1] 0.5 1.0 1.5
> shoesize <- c(9, 12, 6, 10, 10, 16, 8, 4)
> shoesize
[1] 9 12 6 10 10 16 8 4
> sum(shoesize)
[1] 75
> sum(shoesize)/length(shoesize)
[1] 9.375
> mean(shoesize)
[1] 9.375
> x <- c(7, 3, 1, 9)
x from x, and then sum
the result.> x <- c(7, 3, 1, 9)
> mean(x)
[1] 5
> x - mean(x)
[1] 2 -2 -4 4
> sum(x - mean(x)) # answer in one expression
[1] 0
> m <- 13
> se <- 0.25
m (mean), and se (standard error), construct a vector containing the two values, \( m \pm 2 \times se \).[1] 12.5 13.5
> ## one way:
> c(m - 2*se, m + 2*se)
[1] 12.5 13.5
> ## another way:
> m + c(-2, 2)*se
[1] 12.5 13.5
> 1:5
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
seq is the function equivalent of the colon operator.> seq(from = 1, to = 5)
[1] 1 2 3 4 5
> seq(to = 5, from = 1)
[1] 1 2 3 4 5
= value.<- in place of = when specifying
named arguments.> seq(1, 5)
[1] 1 2 3 4 5
> seq(from = 1, to = 5)
[1] 1 2 3 4 5
> seq(begin = 1, end = 5)
Warning: In seq.default(begin = 1, end = 5) :
extra arguments 'begin', 'end' will be disregarded
[1] 1
> ## Try this:
> ?seq
> install.packages("name_of_package")
Try this now:
> install.packages("tidyverse")
library function to load an installed package.library loads a package,
not a library.> library("name_of_package")
> ?filter # returns documentation for a function called filter in the stats package
> library(dplyr)
> ?filter # now returns documentation for a function called filter in the dplyr package!
:: before the function name> ?stats::filter
> ?dplyr::filter
factorial(1:10)Command-RETURN (Mac), or Ctrl-ENTER (Windows).> ## This is a comment
> 1 + 2 # add some numbers
[1] 3
# to start a comment.> install.packages("tidyverse")
> library("tidyverse")
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ tibble 3.1.5 ✓ dplyr 1.0.7
✓ tidyr 1.1.4 ✓ stringr 1.4.0
✓ readr 2.0.2 ✓ forcats 0.5.1
✓ purrr 0.3.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library("tidyverse") at the top of every script file….A data frame is one of the most powerful features in R.
> mtc
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
tibble is a kind of data frame. This one has 32 rows and 11 columns. We only see the first 10 rows because of limited slide/screen space.<dbl>, means double-precision floating point number, which is a computer science term for any number with a decimal point in it (e.g. 1.3333, 3.14159, 1.0)> mtc
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
> mtc = read_csv("https://tinyurl.com/mtcars-csv")
read_csv (from the readr package, part of tidyverse) reads in data frames that are stored in .csv files (.csv = comma-separated values)read_csv("path/to/file/mtcars.csv")?read_csv to learn a bit more.csvs can also be exported from databases and saved locally to be read into R, although later we will learn how to communicate directly with a database to pull data into R.tibble() to make your own data frames from scratch in R> my_data = tibble( # newlines don't do anything, just increase clarity
+ mrn = c(1, 2, 3, 4),
+ age = c(33, 48, 8, 29)
+ )
> my_data
# A tibble: 4 × 2
mrn age
<dbl> <dbl>
1 1 33
2 2 48
3 3 8
4 4 29
dim() gives the dimensions of the data frame. ncol() and nrow() give you the number of columns and the number of rows, respectively.> dim(my_data)
[1] 4 2
> ncol(my_data)
[1] 2
> nrow(my_data)
[1] 4
names() gives you the names of the columns (a vector)> names(my_data)
[1] "mrn" "age"
glimpse() shows you a lot of information, head() returns the first n rows> glimpse(my_data)
Rows: 4
Columns: 2
$ mrn <dbl> 1, 2, 3, 4
$ age <dbl> 33, 48, 8, 29
> head(my_data, n=2)
# A tibble: 2 × 2
mrn age
<dbl> <dbl>
1 1 33
2 2 48
The rest of this section shows the basic data frame functions (“verbs”) in the dplyr package (part of tidyverse). Each operation takes a data frame and produces a new data frame.
filter() picks out rows according to specified conditionsselect() picks out columns according to their namesarrange() sorts the row by values in some column(s)mutate() creates new columns, often based on operations on other columnssummarize() collapses many values in one or more columns down to one value per columnThese can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.
> filter(mtc, mpg >= 25)
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
6 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
> filter(mtc, mpg >= 25, qsec < 19)
# A tibble: 4 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
3 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
> filter(mtc, mpg > 60)
# A tibble: 0 × 11
# … with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>,
# wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
== tests for equality (do not use =)> and < test for greater-than and less-than>= and <= are greater-than-or-equal and less-than-or-equal> c(1,5,-22,4) > 0
[1] TRUE TRUE FALSE TRUE
> filter(mtc, mpg > 30 | mpg < 20)
# A tibble: 22 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
4 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
5 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
6 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
8 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
# … with 12 more rows
| stands for OR, & is AND& inside filter()> filter(mtc, !(mpg > 30 | mpg < 20))
# A tibble: 10 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
! is NOT, which negates the logical condition> filter(mtc, cyl %in% c(6,8)) # equivalent to filter(mtc, cyl==6 | cyl==8)
# A tibble: 21 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
5 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
8 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
# … with 11 more rows
%in% returns true for all elements of the thing on the left that are also elements of the thing on the righthp) greater than 200?> filter(mtc, hp > 200)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
3 10.4 8 460 215 3 5.42 17.8 0 0 3 4
4 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
5 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
7 15 8 301 335 3.54 3.57 14.6 0 1 5 8
mpg between 15 and 20.> filter(mtc, mpg > 15, mpg < 20)
# A tibble: 12 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
3 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
4 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
6 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
9 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
10 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
11 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
12 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
> filter(mtc, row_number()<=3)
# A tibble: 3 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
row_number() to get specific rows. This is more useful once you have sorted the data in a particular order, which we will soon see how to do.> sample_n(mtc, 5)
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
sample_n() to get n randomly selected rows if you don't have a particular condition you would like to filter on.sample_frac() is similar?sample_n() to see how you can sample with replacement or with weights> select(mtc, mpg, qsec, wt)
# A tibble: 32 × 3
mpg qsec wt
<dbl> <dbl> <dbl>
1 21 16.5 2.62
2 21 17.0 2.88
3 22.8 18.6 2.32
4 21.4 19.4 3.22
5 18.7 17.0 3.44
6 18.1 20.2 3.46
7 14.3 15.8 3.57
8 24.4 20 3.19
9 22.8 22.9 3.15
10 19.2 18.3 3.44
# … with 22 more rows
select() can also be used with handy helpers like starts_with() and contains()> select(mtc, starts_with("m"))
# A tibble: 32 × 1
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# … with 22 more rows
select() can also be used with handy helpers like starts_with() and contains()> select(mtc, hp, contains("m"))
# A tibble: 32 × 3
hp mpg am
<dbl> <dbl> <dbl>
1 110 21 1
2 110 21 1
3 93 22.8 1
4 110 21.4 0
5 175 18.7 0
6 105 18.1 0
7 245 14.3 0
8 62 24.4 0
9 95 22.8 0
10 123 19.2 0
# … with 22 more rows
"m" make it a character string (or string for short). If we did not do this, R would think it was looking for a variable called m and not just the plain letter. hp) because the tidyverse functions know that we are working within the dataframe and thus treat the column names like they are variables in their own rightselect() can also be used to select everything except for certain columns> select(mtc, -contains("m"), -hp)
# A tibble: 32 × 8
cyl disp drat wt qsec vs gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 160 3.9 2.62 16.5 0 4 4
2 6 160 3.9 2.88 17.0 0 4 4
3 4 108 3.85 2.32 18.6 1 4 1
4 6 258 3.08 3.22 19.4 1 3 1
5 8 360 3.15 3.44 17.0 0 3 2
6 6 225 2.76 3.46 20.2 1 3 1
7 8 360 3.21 3.57 15.8 0 3 4
8 4 147. 3.69 3.19 20 1 4 2
9 4 141. 3.92 3.15 22.9 1 4 2
10 6 168. 3.92 3.44 18.3 1 4 4
# … with 22 more rows
select() has a friend called pull() which returns a vector instead of a (one-column) data frame> select(mtc, hp)
# A tibble: 32 × 1
hp
<dbl>
1 110
2 110
3 93
4 110
5 175
6 105
7 245
8 62
9 95
10 123
# … with 22 more rows
> pull(mtc, hp)
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
> filter(mtc, row_number()==1)
# A tibble: 1 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
> head(mtc)
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
select() and filter() are functions, so they do not modify their input. You can see mtc is unchanged after calling filter() on it. This holds for functions in general.> mtc_first_row = filter(mtc, row_number()==1)
> mtc_first_row
# A tibble: 1 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
> # tmp = select(mtc, mpg, qsec, wt)
> # filter(tmp, mpg >= 25)
> filter(select(mtc, mpg, qsec, wt), mpg >= 25)
# A tibble: 6 × 3
mpg qsec wt
<dbl> <dbl> <dbl>
1 32.4 19.5 2.2
2 30.4 18.5 1.62
3 33.9 19.9 1.84
4 27.3 18.9 1.94
5 26 16.7 2.14
6 30.4 16.9 1.51
arrange takes a data frame and a column, and sorts the rows by the values in that column (ascending order).> powerful <- filter(mtc, hp > 200)
> arrange(powerful, mpg)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
> arrange(powerful, gear, disp)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
4 10.4 8 460 215 3 5.42 17.8 0 0 3 4
5 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
> arrange(powerful, desc(mpg))
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
2 15 8 301 335 3.54 3.57 14.6 0 1 5 8
3 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
Use arrange() and filter() to get the data for the 5 cars with the highest mpg.
> filter(arrange(mtc, desc(mpg)), row_number()<=5) # "nesting" the calls to filter and arrange
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
or
> cars_by_mpg = arrange(mtc, desc(mpg)) # using a temporary variable
> filter(cars_by_mpg, row_number()<=5)
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
> mtc_vars_subset = select(mtc, mpg, hp)
> mutate(mtc_vars_subset, gpm = 1/mpg)
# A tibble: 32 × 3
mpg hp gpm
<dbl> <dbl> <dbl>
1 21 110 0.0476
2 21 110 0.0476
3 22.8 93 0.0439
4 21.4 110 0.0467
5 18.7 175 0.0535
6 18.1 105 0.0552
7 14.3 245 0.0699
8 24.4 62 0.0410
9 22.8 95 0.0439
10 19.2 123 0.0521
# … with 22 more rows
mutate to add a new column to which is the reciprocal of mpg.= is a new name that you make up which you would like the new column to be called= defines what will go into the new column
-mutate() can create multiple columns at the same time and use multiple columns to define a single new one> mutate(mtc_vars_subset, # the newlines make it more readable
+ gpm = 1/mpg,
+ mpg_hp_ratio = mpg/hp)
# A tibble: 32 × 4
mpg hp gpm mpg_hp_ratio
<dbl> <dbl> <dbl> <dbl>
1 21 110 0.0476 0.191
2 21 110 0.0476 0.191
3 22.8 93 0.0439 0.245
4 21.4 110 0.0467 0.195
5 18.7 175 0.0535 0.107
6 18.1 105 0.0552 0.172
7 14.3 245 0.0699 0.0584
8 24.4 62 0.0410 0.394
9 22.8 95 0.0439 0.24
10 19.2 123 0.0521 0.156
# … with 22 more rows
mtc_vars_subset is unchanged after the mutate.> df = tibble(number = c("1", "2", "3"))
> mutate(df, number_plus_1 = number + 1)
Error: Problem with `mutate()` column `number_plus_1`.
ℹ `number_plus_1 = number + 1`.
x non-numeric argument to binary operator
mutate() is also useful for converting data types, in this case text to numbers> mutate(df, number = as.numeric(number))
# A tibble: 3 × 1
number
<dbl>
1 1
2 2
3 3
> summarize(mtc, mpg_avg=mean(mpg))
# A tibble: 1 × 1
mpg_avg
<dbl>
1 20.1
summarize() boils down the data frame according to the conditions it gets. In this case, it creates a data frame with a single column called mpg_avg that contains the mean of the mpg column> summarize(mtc, # newlines not necessary, again just increase clarity
+ mpg_avg = mean(mpg),
+ mpg_2x_max = max(2*mpg),
+ hp_mpg_ratio_min = min(hp/mpg))
# A tibble: 1 × 3
mpg_avg mpg_2x_max hp_mpg_ratio_min
<dbl> <dbl> <dbl>
1 20.1 67.8 1.71
filter() picks out rows according to specified conditionsselect() picks out columns according to their namesarrange() sorts the row by values in some column(s)mutate() creates new columns, often based on operations on other columnssummarize() collapses many values in one or more columns down to one value per columnAll verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
First, let's load in some new data.
> data1 <- read_csv("http://stanford.edu/~sbagley2/bios205/data/data1.csv")
Error in open.connection(structure(4L, class = c("curl", "connection"), conn_id = <pointer: 0x271>), : HTTP error 404.
> data1
Error in eval(expr, envir, enclos): object 'data1' not found
<chr> is short for “character string”, which means text data> data1_by_gender <- group_by(data1, gender)
Error in group_by(data1, gender): object 'data1' not found
> data1_by_gender
Error in eval(expr, envir, enclos): object 'data1_by_gender' not found
R knows that this is really two sub-data-frames (one for each group) instead of one.> summarize(data1_by_gender, mean_weight = mean(weight))
Error in summarize(data1_by_gender, mean_weight = mean(weight)): object 'data1_by_gender' not found
summarize() works the same as before, except now it returns two rows instead of one because there are two groups that were defined by group_by(gender).> data1_by_gender_and_shoesize = group_by(data1, gender, shoesize)
Error in group_by(data1, gender, shoesize): object 'data1' not found
> summarize(data1_by_gender_and_shoesize,
+ mean_weight = mean(weight),
+ mean_age = mean(age))
Error in summarize(data1_by_gender_and_shoesize, mean_weight = mean(weight), : object 'data1_by_gender_and_shoesize' not found
gender and shoesize appear as columns in the resultgender and shoesize in the original datan() function counts the number of rows in each group:> summarize(data1_by_gender, count = n())
Error in summarize(data1_by_gender, count = n()): object 'data1_by_gender' not found
n_distinct() function counts the number of distinct (unique) values in the specified column:> summarize(data1_by_gender, n_sizes = n_distinct(shoesize))
Error in summarize(data1_by_gender, n_sizes = n_distinct(shoesize)): object 'data1_by_gender' not found
distinct() filters out any duplicate rows in a dataframe. The equivalent for vectors is unique()> state_data <- read_csv("http://stanford.edu/~sbagley2/bios205/data/state_data.csv")
Error in open.connection(structure(5L, class = c("curl", "connection"), conn_id = <pointer: 0x27b>), : HTTP error 404.
> state_data
Error in eval(expr, envir, enclos): object 'state_data' not found
> state_data_by_region <- group_by(state_data, region)
Error in group_by(state_data, region): object 'state_data' not found
> summarize(state_data_by_region, n_states = n())
Error in summarize(state_data_by_region, n_states = n()): object 'state_data_by_region' not found
filter() the grouped data in data1_by_gender to pick out the rows for the youngest male and female (hint: use min() and ==).
> filter(data1_by_gender, age==min(age))
Error in filter(data1_by_gender, age == min(age)): object 'data1_by_gender' not found
summarize is used as an argument to arrange.summarize, then arrange.> arrange(summarize(group_by(state_data, region), sd_area = sd(area)), sd_area)
> state_data_by_region <- group_by(state_data, region)
Error in group_by(state_data, region): object 'state_data' not found
> region_area_sds <- summarize(state_data_by_region, sd_area = sd(area))
Error in summarize(state_data_by_region, sd_area = sd(area)): object 'state_data_by_region' not found
> arrange(region_area_sds, sd_area)
Error in arrange(region_area_sds, sd_area): object 'region_area_sds' not found
%>%, to “pipe” the result from the first
function call to the second function call.> state_data %>%
+ group_by(region) %>%
+ summarize(sd_area = sd(area)) %>%
+ arrange(sd_area)
Error in group_by(., region): object 'state_data' not found
state_datasd_areasd_area> df1 %>% fun(x)
is converted into:
> fun(df1, x)
fun.> c(1,44,21,0,-4) %>%
+ sum()
[1] 62
> sum(c(1,44,21,0,-4))
[1] 62
> 1 %>% `+`(1) # `+` is just a function that takes two arguments!
[1] 2
. syntax to send the argument elsewhere:> values = c(1,2,3,NA)
>
> TRUE %>%
+ mean(values, na.rm=.)
[1] 2
ggplot2 is a very powerful graphics package.tidyverse.> install.packages("ggplot2")
> library("ggplot2")
> ggplot(data = mtc, mapping = aes(x = hp, y = mpg)) +
+ geom_point()
ggplot2, the function is called simply ggplot()ggplot(data = mtc, mapping = aes(x = hp, y = mpg)) + geom_point()
data = mtc: this tells which tibble contains the data to be plottedmapping = aes(x = hp, y = mpg): use the data in the hp column on x-axis, mpg column on y-axisgeom_point(): plot the data as pointsggplot(mtc, aes(hp, mpg)) +
geom_point()
> ggplot(mtc, aes(hp, mpg)) +
+ geom_line()
> ggplot(mtc, aes(hp, mpg)) +
+ geom_point() +
+ geom_smooth(method="lm")
`geom_smooth()` using formula 'y ~ x'
"lm" means “linear model,” which is a least-squares regression line.> ggplot(mtc, aes(hp, mpg)) +
+ geom_point() +
+ geom_smooth(method="loess")
`geom_smooth()` using formula 'y ~ x'
> mtc %>% # with the pipe
+ ggplot(aes(hp, mpg)) +
+ geom_point() +
+ geom_smooth(method="loess", se=FALSE)
`geom_smooth()` using formula 'y ~ x'
se = FALSE means do not plot the confidence band (using the standard error)gender is a discrete variable, with two values.> data1 %>%
+ group_by(gender) %>%
+ summarize(mean_age=mean(age), mean_weight=mean(weight)) %>%
+ ggplot(aes(gender, mean_weight)) +
+ geom_col()
Error in group_by(., gender): object 'data1' not found
geom_col() is used to make a bar plot. Height of bar is the value for that group.ggplot2 is different, and is based on the idea of a “grammar of
graphics,” a set of primitives and rules for combining them in a way
that makes sense for plotting data.aes to map from variables (columns in data frame) to
aethetics (visual properties of the plot): x, y, color, size,
shape, and others.geom. This determines the type of the plot: point (a
scatterplot), line (line graph or line chart), bar (barplot), and
others.stat (statistical transformation): often identity (do
no transformation), but can be used to count, bin, or summarize
data (e.g., in a histogram).scale. This converts from the units used in the data
frame to the units used for display.
ggplot to look for a linear relationship between hp and 1/mpg in our mtc data> ggplot(mtc, aes(hp, 1/mpg)) +
+ geom_point() +
+ geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula 'y ~ x'
> mtc %>%
> mutate(gpm = 1/mpg) %>%
> ggplot(aes(hp, gpm)) +
> geom_point() +
> geom_smooth(method="lm", se=FALSE)
> orange <- as_tibble(Orange) # this data is pre-loaded into R
> orange %>%
+ filter(Tree == 2) %>%
+ ggplot(aes(age, circumference)) +
+ geom_point()
age > 1000> orange %>%
+ filter(Tree == 2, age > 1000) %>%
+ ggplot(aes(age, circumference)) +
+ geom_point()
circum_in which is the circumference in inches, not in millimeters.> mutate(orange, circum_in = circumference/(10 * 2.54))
# A tibble: 35 × 4
Tree age circumference circum_in
<ord> <dbl> <dbl> <dbl>
1 1 118 30 1.18
2 1 484 58 2.28
3 1 664 87 3.43
4 1 1004 115 4.53
5 1 1231 120 4.72
6 1 1372 142 5.59
7 1 1582 145 5.71
8 2 118 33 1.30
9 2 484 69 2.72
10 2 664 111 4.37
# … with 25 more rows
Use the state_data data frame for this exercise.
> state_data %>%
+ group_by(region) %>%
+ summarize(mean_area = mean(area)) %>%
+ arrange(desc(mean_area))
Error in group_by(., region): object 'state_data' not found
> state_data %>%
+ group_by(region) %>%
+ summarize(area_range = max(area) - min(area)) %>%
+ arrange(area_range)
Error in group_by(., region): object 'state_data' not found
> state_data2 <- state_data %>%
+ group_by(region) %>%
+ mutate(region_mean = mean(area))
Error in group_by(., region): object 'state_data' not found
> state_data2
Error in eval(expr, envir, enclos): object 'state_data2' not found
region_mean column has 50 values, one for each state, depending on the region the state is in.> state_data2 %>%
+ mutate(diff = abs(area-region_mean)) %>%
+ filter(diff == min(diff))
Error in mutate(., diff = abs(area - region_mean)): object 'state_data2' not found
ungroup() to undo the group_by() so that the filter() is applied across the whole data frame and not region-by-region> state_data2 %>%
+ mutate(diff = abs(area-region_mean)) %>%
+ ungroup() %>%
+ filter(diff == min(diff))
Error in mutate(., diff = abs(area - region_mean)): object 'state_data2' not found
> state_data %>%
+ group_by(region) %>%
+ filter(area == min(area))
Error in group_by(., region): object 'state_data' not found
> # install.packages("nycflights13")
> library(nycflights13)
> head(flights)
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
> head(airports)
# A tibble: 6 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne…
> head(planes)
# A tibble: 6 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing mu… EMBRAER EMB-1… 2 55 NA Turbo-…
2 N102UW 1998 Fixed wing mu… AIRBUS INDUST… A320-… 2 182 NA Turbo-…
3 N103US 1999 Fixed wing mu… AIRBUS INDUST… A320-… 2 182 NA Turbo-…
4 N104UW 1999 Fixed wing mu… AIRBUS INDUST… A320-… 2 182 NA Turbo-…
5 N10575 2002 Fixed wing mu… EMBRAER EMB-1… 2 55 NA Turbo-…
6 N105UW 1999 Fixed wing mu… AIRBUS INDUST… A320-… 2 182 NA Turbo-…
> head(weather)
# A tibble: 6 × 15
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
# … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
# time_hour <dttm>
flights connects to planes via a single variable, tailnum.flights connects to airlines through the carrier variable.flights connects to airports in two ways: via the origin and dest variables.flights connects to weather via origin (the location), and year, month, day and hour (the time).flights> flights %>%
+ select(tailnum, origin, dest, carrier) %>%
+ inner_join(airlines, by="carrier")
# A tibble: 336,776 × 5
tailnum origin dest carrier name
<chr> <chr> <chr> <chr> <chr>
1 N14228 EWR IAH UA United Air Lines Inc.
2 N24211 LGA IAH UA United Air Lines Inc.
3 N619AA JFK MIA AA American Airlines Inc.
4 N804JB JFK BQN B6 JetBlue Airways
5 N668DN LGA ATL DL Delta Air Lines Inc.
6 N39463 EWR ORD UA United Air Lines Inc.
7 N516JB EWR FLL B6 JetBlue Airways
8 N829AS LGA IAD EV ExpressJet Airlines Inc.
9 N593JB JFK MCO B6 JetBlue Airways
10 N3ALAA LGA ORD AA American Airlines Inc.
# … with 336,766 more rows
> x <- tibble(
+ key = c(1,2,3),
+ val_x = c("x1","x2","x3")
+ )
> y <- tibble(
+ key = c(1,2,4),
+ val_y = c("y1","y2","y3")
+ )
> inner_join(x, y, by="key")
# A tibble: 2 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
by="column"
> x <- tibble(
+ key = c(1,2,2,3),
+ val_x = c("x1","x2","x3","x4")
+ )
> y <- tibble(
+ key = c(1,2,2,4),
+ val_y = c("y1","y2","y3","y4")
+ )
> inner_join(x, y, by="key")
# A tibble: 5 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 2 x2 y3
4 2 x3 y2
5 2 x3 y3
When keys are duplicated, multiple rows can match multiple rows, so each possible combination is produced
> inner_join(airports, flights, by="origin")
Error: Join columns must be present in data.
x Problem with `origin`.
> inner_join(airports, flights, by=c("faa"="origin"))
# A tibble: 336,776 × 26
faa name lat lon alt tz dst tzone year month day dep_time
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <int> <int> <int> <int>
1 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 517
2 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 554
3 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 555
4 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 558
5 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 559
6 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 601
7 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 606
8 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 607
9 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 608
10 EWR Newark… 40.7 -74.2 18 -5 A Ameri… 2013 1 1 615
# … with 336,766 more rows, and 14 more variables: sched_dep_time <int>,
# dep_delay <dbl>, arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Use joins to find the models of airplane that fly into Seattle Tacoma Intl.
Use joins to find the models of airplane that fly into Seattle Tacoma Intl.
> airports %>%
+ filter(name=="Seattle Tacoma Intl") %>%
+ inner_join(flights, by=c("faa"="dest")) %>%
+ inner_join(planes, by="tailnum") %>%
+ select(model) %>%
+ distinct()
# A tibble: 24 × 1
model
<chr>
1 737-890
2 737-832
3 737-924ER
4 A320-232
5 737-824
6 757-231
7 757-232
8 757-2Q8
9 767-332
10 757-222
# … with 14 more rows
x.y.x and y.
> flights %>%
+ select(tailnum, year:day, hour, origin) %>%
+ left_join(weather, by=c("year", "month", "day", "hour", "origin")) %>%
+ head(3)
# A tibble: 3 × 16
tailnum year month day hour origin temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 N14228 2013 1 1 5 EWR 39.0 28.0 64.4 260 12.7
2 N24211 2013 1 1 5 LGA 39.9 25.0 54.8 250 15.0
3 N619AA 2013 1 1 5 JFK 39.0 27.0 61.6 260 15.0
# … with 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
> flights %>%
> select(tailnum, year:day, hour, origin) %>%
> rename(departure = origin) %>%
> left_join(weather, by=c("year", "month", "day", "hour", "departure"="origin"))
NAs in variables you are going to join onanti_join() and semi_join() are useful tools (filtering joins) to diagnose problems
anti_join() keeps only the rows in x that don't have a match in ysemi_join() keeps only the rows in x that do have a match in yIt appears some of the tailnums in flights do not appear in planes. Is there something those flights have in common that might help us diagnose the issue?
> bad_flight_carrier_count = flights %>%
+ anti_join(planes, by="tailnum") %>%
+ sample_n(10)
It appears some of the tailnums in flights do not appear in planes. Is there something those flights have in common that might help us diagnose the issue?
> bad_flight_carrier_count = flights %>%
+ anti_join(planes, by="tailnum") %>%
+ count(carrier) %>%
+ arrange(desc(n))
> bad_flight_carrier_count
# A tibble: 10 × 2
carrier n
<chr> <int>
1 MQ 25397
2 AA 22558
3 UA 1693
4 9E 1044
5 B6 830
6 US 699
7 FL 187
8 DL 110
9 F9 50
10 WN 38
count(x) is a shortcut for group_by(x) %>% summarize(n=n()) Let's compare the counts of airlines with missing planes to the counts of airlines across all flight data
> flight_carrier_count = flights %>%
+ count(carrier) %>%
+ arrange(desc(n))
> flight_carrier_count
# A tibble: 16 × 2
carrier n
<chr> <int>
1 UA 58665
2 B6 54635
3 EV 54173
4 DL 48110
5 AA 32729
6 MQ 26397
7 US 20536
8 9E 18460
9 WN 12275
10 VX 5162
11 FL 3260
12 AS 714
13 F9 685
14 YV 601
15 HA 342
16 OO 32
We can already see the trend but let's clean it up a bit
> flight_carrier_count %>%
> left_join(bad_flight_carrier_count,
> by="carrier",
> suffix=c("_all", "_bad")) %>%
> replace_na(list(n_bad=0)) %>%
> mutate(bad_ratio = n_bad/n_all) %>%
> left_join(airlines, by="carrier") %>%
> ggplot(aes(y=name, x=bad_ratio)) +
> geom_point()
> head(orange, 3)
# A tibble: 3 × 3
Tree age circumference
<ord> <dbl> <dbl>
1 1 118 30
2 1 484 58
3 1 664 87
orange contains data on five trees.> ggplot(orange, aes(age, circumference)) +
+ geom_point()
> ggplot(orange, aes(age, circumference)) +
+ geom_point(aes(color = Tree))
> ggplot(orange, aes(age, circumference)) +
+ geom_point(aes(color = Tree), size=3)
aes()) tell ggplot how the columns of the data relate to what should go on the plotaes(). e.g. size=3 in this case.ggplot() are passed down to the geoms> ggplot(orange, aes(age, circumference)) +
+ geom_point(aes(shape = Tree), size=3)
Warning: Using shapes for an ordinal variable is not advised
> ggplot(orange, aes(age, circumference)) +
+ geom_point(aes(color = Tree, shape = Tree), size=3)
Warning: Using shapes for an ordinal variable is not advised
Reproduce this plot
> ggplot(orange, aes(age, circumference)) +
+ geom_point(aes(color = Tree)) +
+ geom_line(aes(color = Tree))
or
> orange %>%
+ ggplot(aes(age, circumference, color=Tree)) +
+ geom_point() +
+ geom_line()
ggplot are passed down to the geoms> ggplot(orange, aes(age, circumference)) +
+ geom_point() +
+ facet_wrap(~ Tree) +
+ geom_smooth(method = loess, se = FALSE)
`geom_smooth()` using formula 'y ~ x'
Reproduce the following plot:
`geom_smooth()` using formula 'y ~ x'
> ggplot(orange, aes(age, circumference, color=Tree)) +
+ geom_point() +
+ geom_smooth(method = loess, se = FALSE)
`geom_smooth()` using formula 'y ~ x'
> ggplot(orange, aes(age, circumference)) +
+ geom_point(aes(shape=Tree)) +
+ geom_smooth(aes(color=Tree), method = loess, se = FALSE)
Warning: Using shapes for an ordinal variable is not advised
`geom_smooth()` using formula 'y ~ x'
> ggplot(orange, aes(age, circumference)) +
+ geom_point(size = 3) +
+ facet_wrap(~ Tree)
> orange %>%
+ ggplot(aes(age, circumference)) +
+ geom_point() +
+ geom_line() +
+ facet_wrap(~ Tree)
facet_wrap(~ facet_variable): This puts the
facets left-to-right, top-to-bottom, wrapping around.facet_grid(y-variable ~ .): This puts them in
a vertically-aligned stack, by value of the y-variable.facet_grid(. ~ x-variable): This puts them in
a horizontally-aligned stack, by value of the x-variable.> ggplot(orange, aes(age, circumference)) +
+ geom_point() +
+ facet_grid(. ~ Tree)
> ggplot(orange, aes(age, circumference)) +
+ geom_point() +
+ facet_grid(Tree ~ .)
> ggplot(orange, aes(age, circumference)) +
+ geom_point() +
+ facet_wrap(~ Tree) +
+ geom_smooth(method = lm, se=FALSE)
`geom_smooth()` using formula 'y ~ x'
> ggplot(state_data, aes(region, area)) +
+ geom_point()
Error in ggplot(state_data, aes(region, area)): object 'state_data' not found
> plot = state_data %>%
+ group_by(region) %>%
+ summarize(
+ region_mean = mean(area),
+ region_sd = sd(area)) %>%
+ ggplot(aes(region, region_mean)) +
+ geom_point()
Error in group_by(., region): object 'state_data' not found
> plot
function (x, y, ...)
UseMethod("plot")
<bytecode: 0x7f79573a56b8>
<environment: namespace:base>
> plot +
+ geom_errorbar(aes(ymin = region_mean - region_sd,
+ ymax = region_mean + region_sd,
+ width = 0.3))
NULL
ggplot(...) + geom_point() is a strange
expression: it uses the + operator to add things (plots and
plot specifications), which are not numbers.+ determine which piece of code, called a
method, to run.> p <- ggplot(mtc, aes(hp, mpg))
> l <- geom_point()
> p + l
The goal of reproducible analysis is to produce a computational artifact that others can view, scrutinize, test, and run, to convince themselves that your ideas are valid. (It's also good for you to be as skeptical of your work.) This means you should write code to be run more than once and by others.
Doing so requires being organized in several ways:
By the end of the course you should be able to…
R for Data Science (R4DS): https://r4ds.had.co.nz
Cheatsheets: https://www.rstudio.com/resources/cheatsheets/